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Abstract —This paper develops a model that addresses 
sentence embedding, a hot topic in current natural lan¬ 
guage processing research, using recurrent neural networks 
(RNN) with Long Short-Term Memory (LSTM) cells. The 
proposed LSTM-RNN model sequentially takes each word 
in a sentence, extracts its information, and embeds it into 
a semantic vector. Due to its ability to capture long term 
memory, the LSTM-RNN accumulates increasingly richer 
information as it goes through the sentence, and when it 
reaches the last word, the hidden layer of the network 
provides a semantic representation of the whole sentence. 
In this paper, the LSTM-RNN is trained in a weakly 
supervised manner on user click-through data logged by a 
commercial web search engine. Visualization and analysis 
are performed to understand how the embedding process 
works. The model is found to automatically attenuate the 
unimportant words and detects the salient keywords in 
the sentence. Furthermore, these detected keywords are 
found to automatically activate different cells of the LSTM- 
RNN, where words belonging to a similar topic activate the 
same cell. As a semantic representation of the sentence, 
the embedding vector can be used in many different 
applications. These automatic keyword detection and topic 
allocation abilities enabled by the LSTM-RNN allow the 
network to perform document retrieval, a difficult language 
processing task, where the similarity between the query and 
documents can be measured by the distance between their 
corresponding sentence embedding vectors computed by 
the LSTM-RNN. On a web search task, the LSTM-RNN 
embedding is shown to significantly outperform several 
existing state of the art methods. We emphasize that the 
proposed model generates sentence embedding vectors that 
are specially useful for web document retrieval tasks. A 
comparison with a well known general sentence embedding 
method, the Paragraph Vector, is performed. The results 
show that the proposed method in this paper significantly 
outperforms it for web document retrieval task. 

Index Terms —Deep Learning, Long Short-Term Mem¬ 
ory, Sentence Embedding. 
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L earning a good representation (or features) of 
input data is an important task in machine learning. 
In text and language processing, one such problem is 
learning of an embedding vector for a sentence; that is, to 
train a model that can automatically transform a sentence 
to a vector that encodes the semantic meaning of the 
sentence. While word embedding is learned using a 
loss function defined on word pairs, sentence embedding 
is learned using a loss function defined on sentence 
pairs. In the sentence embedding usually the relationship 
among words in the sentence, i.e., the context informa¬ 
tion, is taken into consideration. Therefore, sentence em¬ 
bedding is more suitable for tasks that require computing 
semantic similarities between text strings. By mapping 
texts into a unified semantic representation, the embed¬ 
ding vector can be further used for different language 
processing applications, such as machine translation m, 
sentiment analysis and information retrieval fa. 
In machine translation, the recurrent neural networks 
(RNN) with Long Short-Term Memory (LSTM) cells, or 
the LSTM-RNN, is used to encode an English sentence 
into a vector, which contains the semantic meaning of 
the input sentence, and then another LSTM-RNN is 
used to generate a Erench (or another target language) 
sentence from the vector. The model is trained to best 
predict the output sentence. In fa, a paragraph vector 
is learned in an unsupervised manner as a distributed 
representation of sentences and documents, which are 
then used for sentiment analysis. Sentence embedding 
can also be applied to information retrieval, where the 
contextual information are properly represented by the 
vectors in the same space for fuzzy text matching 0. 

In this paper, we propose to use an RNN to sequen¬ 
tially accept each word in a sentence and recurrently map 
it into a latent space together with the historical informa¬ 
tion. As the RNN reaches the last word in the sentence, 
the hidden activations form a natural embedding vector 
for the contextual information of the sentence. We further 
incorporate the LSTM cells into the RNN model (i.e. the 
LSTM-RNN) to address the difficulty of learning long 
term memory in RNN. The learning of such a model 
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is performed in a weakly supervised manner on the 
click-through data logged by a commercial web search 
engine. Although manually labelled data are insufficient 
in machine learning, logged data with limited feedback 
signals are massively available due to the widely used 
commercial web search engines. Limited feedback in¬ 
formation such as click-through data provides a weak 
supervision signal that indicates the semantic similarity 
between the text on the query side and the clicked text 
on the document side. To exploit such a signal, the 
objective of our training is to maximize the similarity 
between the two vectors mapped by the LSTM-RNN 
from the query and the clicked document, respectively. 
Consequently, the learned embedding vectors of the 
query and clicked document are specifically useful for 
web document retrieval task. 

An important contribution of this paper is to analyse 
the embedding process of the LSTM-RNN by visualizing 
the internal activation behaviours in response to different 
text inputs. We show that the embedding process of the 
learned LSTM-RNN effectively detects the keywords, 
while attenuating less important words, in the sentence 
automatically by switching on and off the gates within 
the LSTM-RNN cells. We further show that different 
cells in the learned model indeed correspond to differ¬ 
ent topics, and the keywords associated with a similar 
topic activate the same cell unit in the model. As the 
LSTM-RNN reads to the end of the sentence, the topic 
activation accumulates and the hidden vector at the last 
word encodes the rich contextual information of the 
entire sentence. For this reason, a natural application 
of sentence embedding is web search ranking, in which 
the embedding vector from the query can be used to 
match the embedding vectors of the candidate documents 
according to the maximum cosine similarity rule. Evalu¬ 
ated on a real web document ranking task, our proposed 
method significantly outperforms many of the existing 
state of the art methods in NDCG scores. Please note 
that when we refer to document in the paper we mean 
the title (headline) of the document. 

IT Related Work 

Inspired by the word embedding method |[4l, O, the 
authors in (Jl proposed an unsupervised learning method 
to learn a paragraph vector as a distributed representation 
of sentences and documents, which are then used for 
sentiment analysis with superior performance. However, 
the model is not designed to capture the fine-grained 
sentence structure. In O, an unsupervised sentence 
embedding method is proposed with great performance 
on large corpus of contiguous text corpus, e.g., the 
BookCorpus Q. The main idea is to encode the sentence 
s{t) and then decode previous and next sentences, i.e.. 


s{t—l) and using two separate decoders. The en¬ 

coder and decoders are RNNs with Gated Recurrent Unit 
(GRU) 0. However, this sentence embedding method 
is not designed for document retrieval task having a 
supervision among queries and clicked and unclicked 
documents. In (91, a Semi-Supervised Recursive Au¬ 
toencoder (RAE) is proposed and used for sentiment 
prediction. Similar to our proposed method, it does not 
need any language specific sentiment parsers. A greedy 
approximation method is proposed to construct a tree 
structure for the input sentence. It assigns a vector per 
word. It can become practically problematic for large 
vocabularies. It also works both on unlabeled data and 
supervised sentiment data. 

Similar to the recurrent models in this paper. The 
DSSM O and CLSM (TOl models, developed for in¬ 
formation retrieval, can also be interpreted as sentence 
embedding methods. However, DSSM treats the input 
sentence as a bag-of-words and does not model word 
dependencies explicitly. CLSM treats a sentence as a bag 
of n-grams, where n is defined by a window, and can 
capture local word dependencies. Then a Max-pooling 
layer is used to form a global feature vector. Methods in 
m are also convolutional based networks for Natural 
Language Processing (NLP). These models, by design, 
cannot capture long distance dependencies, i.e., depen¬ 
dencies among words belonging to non-overlapping n- 
grams. In IT21 a Dynamic Convolutional Neural Network 
(DCNN) is proposed for sentence embedding. Similar to 
CLSM, DCNN does not rely on a parse tree and is easily 
applicable to any language. However, different from 
CLSM where a regular max-pooling is used, in DCNN a 
dynamic /c-max-pooling is used. This means that instead 
of just keeping the largest entries among word vectors in 
one vector, k largest entries are kept in k different vec¬ 
tors. DCNN has shown good performance in sentiment 
prediction and question type classification tasks. In (TSl, 
a convolutional neural network architecture is proposed 
for sentence matching. It has shown great performance in 
several matching tasks. In O, a Bilingually-constrained 
Recursive Auto-encoders (BRAE) is proposed to create 
semantic vector representation for phrases. Through ex¬ 
periments it is shown that the proposed method has great 
performance in two end-to-end SMT tasks. 

Long short-term memory networks were developed 
in ca to address the difficulty of capturing long term 
memory in RNN. It has been successfully applied to 
speech recognition, which achieves state-of-art perfor¬ 
mance (161, (TTII - In text analysis, LSTM-RNN treats a 
sentence as a sequence of words with internal structures, 
i.e., word dependencies. It encodes a semantic vector of 
a sentence incrementally which differs from DSSM and 
CLSM. The encoding process is performed left-to-right, 
word-by-word. At each time step, a new word is encoded 
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into the semantic vector, and the word dependencies 
embedded in the vector are “updated”. When the process 
reaches the end of the sentence, the semantic vector has 
embedded all the words and their dependencies, hence, 
can be viewed as a feature vector representation of the 
whole sentence. In the machine translation work 111, an 
input English sentence is converted into a vector repre¬ 
sentation using LSTM-RNN, and then another LSTM- 
RNN is used to generate an output French sentence. 
The model is trained to maximize the probability of 
predicting the correct output sentence. In |T8|, there are 
two main composition models, ADD model that is bag 
of words and BI model that is a summation over bi-gram 
pairs plus a non-linearity. In our proposed model, instead 
of simple summation, we have used LSTM model with 
letter tri-grams which keeps valuable information over 
long intervals (for long sentences) and throws away use¬ 
less information. In IT9l . an encoder-decoder approach is 
proposed to jointly learn to align and translate sentences 
from English to French using RNNs. The concept of 
“attention” in the decoder, discussed in this paper, is 
closely related to how our proposed model extracts 
keywords in the document side. For further explanations 
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please see section |V-A2| In 1201 a set of visualizations 
are presented for RNNs with and without LSTM cells 
and GRUs. Different from our work where the target task 
is sentence embedding for document retrieval, the target 
tasks in l2Ql were character level sequence modelling for 
text characters and source codes. Interesting observations 
about interpretability of some LSTM cells and statistics 
of gates activations are presented. In section |V-A| we 


show that some of the results of our visualization are 
consistent with the observations reported in |20l. We 
also present more detailed visualization specific to the 
document retrieval task using click-through data. We also 
present visualizations about how our proposed model can 
be used for keyword detection. 


Different from the aforementioned studies, the method 
developed in this paper trains the model so that sentences 
that are paraphrase of each other are close in their 
semantic embedding vectors — see the description in 
Sec. |lv| further ahead. Another reason that LSTM-RNN 
is particularly effective for sentence embedding, is its 
robustness to noise. For example, in the web document 
ranking task, the noise comes from two sources: (i) Not 
every word in query / document is equally important, 
and we only want to “remember” salient words using 
the limited “memory”, (ii) A word or phrase that is 
important to a document may not be relevant to a 
given query, and we only want to “remember” related 
words that are useful to compute the relevance of the 
document for a given query. We will illustrate robustness 
of LSTM-RNN in this paper. The structure of LSTM- 
RNN will also circumvent the serious limitation of using 


a fixed window size in CLSM. Our experiments show 
that this difference leads to significantly better results in 
web document retrieval task. Furthermore, it has other 
advantages. It allows us to capture keywords and key 
topics effectively. The models in this paper also do not 
need the extra max-pooling layer, as required by the 
CLSM, to capture global contextual information and they 
do so more effectively. 

III. Sentence Embedding Using RNNs with and 
WITHOUT LSTM Cells 

In this section, we introduce the model of recurrent 
neural networks and its long short-term memory version 
for learning the sentence embedding vectors. We start 
with the basic RNN and then proceed to LSTM-RNN. 

A. The basic version of RNN 

The RNN is a type of deep neural networks that 
are “deep” in temporal dimension and it has been used 
extensively in time sequence modelling ED, ED, ED, 
l24l . E3, EH, ED, ED, ED- The main idea of using 
RNN for sentence embedding is to find a dense and 
low dimensional semantic representation by sequentially 
and recurrently processing each word in a sentence and 
mapping it into a low dimensional vector. In this model, 
the global contextual features of the whole text will be 
in the semantic representation of the last word in the 
text sequence — see Figure where x(t) is the t-th 
word, coded as a 1-hot vector, W/,, is a fixed hashing 
operator similar to the one used in 13 that converts the 
word vector to a letter tri-gram vector, W is the input 
weight matrix, W^ec is the recurrent weight matrix, y(t) 
is the hidden activation vector of the RNN, which can be 
used as a semantic representation of the t-th word, and 
y(t) associated to the last word x(m) is the semantic 
representation vector of the entire sentence. Note that 
this is very different from the approach in 0 where the 
bag-of-words representation is used for the whole text 
and no context information is used. This is also different 
from oni where the sliding window of a fixed size (akin 
to an FIR filter) is used to capture local features and a 
max-pooling layer on the top to capture global features. 
In the RNN there is neither a fixed-sized window nor 
a max-pooling layer; rather the recurrence is used to 
capture the context information in the sequence (akin 
to an HR filter). 

The mathematical formulation of the above RNN 
model for sentence embedding can be expressed as 

1(f) = W^x(f) 

y(f)=/(Wl(f)+Wrecy(i-l) + b) (1) 

where W and Wrec are the input and recurrent matrices 
to he learned, is a fixed word hashing operator, b 
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Embedding vector 



Fig. 1. The basic architecture of the RNN for sentence embedding, 
where temporal recurrence is used to model the contextual information 
across words in the text string. The hidden activation vector corre¬ 
sponding to the last word is the sentence embedding vector (blue). 


is the bias vector and /(•) is assumed to be tanh('). 
Note that the architecture proposed here for sentence 
embedding is slightly different from traditional RNN in 
that there is a word hashing layer that convert the high 
dimensional input into a relatively lower dimensional 
letter tri-gram representation. There is also no per word 
supervision during training, instead, the whole sentence 
has a label. This is explained in more detail in section 
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B. The RNN with LSTM cells 

Although RNN performs the transformation from the 
sentence to a vector in a principled manner, it is generally 
difficult to learn the long term dependency within the 
sequence due to vanishing gradients problem. One of 
the effective solutions for this problem in RNNs is 
using memory cells instead of neurons originally pro¬ 
posed in ifTSll as Long Short-Term Memory (LSTM) and 
completed in (SOl and ED by adding forget gate and 
peephole connections to the architecture. 

We use the architecture of LSTM illustrated in Fig. 

for the proposed sentence embedding method. In this 
figure, i(t), ^{t) ,o(f) ,c(t) are input gate, forget gate, 
output gate and cell state vector respectively, W^i, Wp 2 
and Wp 3 are peephole connections, W^, W^eci and b^, 
i = 1, 2, 3,4 are input connections, recurrent connections 
and bias values, respectively, ^(•) and /i(-) are tanh(-) 
function and cr(-) is the sigmoid function. We use this 
architecture to find y for each word, then use the y(m) 
corresponding to the last word in the sentence as the 
semantic vector for the entire sentence. 

Considering Fig. the forward pass for LSTM-RNN 



Fig. 2. The basic LSTM architecture used for sentence embedding 

model is as follows: 

Ygit) = 5(W4l(t) + Wrec4y(^ “ 1) + b4) 
i(t) = <T(W3l(i) + Wrec3y(i - 1) + Wp3C{t - 1) + bg) 
f{t) = <T(W2l(t) + Wrec2y(i - 1) + Wp2C(i - 1) + b2) 
c(i) = f{t) o c{t - 1) + i{t) o yg{t) 

o{t) = (T(Wil(i) + WrecMt - 1) + Wpic(i) + bi) 

y{t) = o{t) o h{c{t)) (2) 

where o denotes Hadamard (element-wise) product. A 
diagram of the proposed model with more details is 
presented in section VI of Supplementary Materials. 

IV. Learning Method 

To learn a good semantic representation of the input 
sentence, our objective is to make the embedding vectors 
for sentences of similar meaning as close as possible, 
and meanwhile, to make sentences of different meanings 
as far apart as possible. This is challenging in practice 
since it is hard to collect a large amount of manually 
labelled data that give the semantic similarity signal 
between different sentences. Nevertheless, the widely 
used commercial web search engine is able to log 
massive amount of data with some limited user feedback 
signals. For example, given a particular query, the click¬ 
through information about the user-clicked document 
among many candidates is usually recorded and can be 
used as a weak (binary) supervision signal to indicate 
the semantic similarity between two sentences (on the 
query side and the document side). In this section, we 
explain how to leverage such a weak supervision signal 
to learn a sentence embedding vector that achieves the 
aforementioned training objective. Please also note that 
above objective to make sentences with similar meaning 
as close as possible is similar to machine translation 
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Fig. 3. The click-through signal can be used as a (binary) indication 
of the semantic similarity between the sentence on the query side and 
the sentence on the document side. The negative samples are randomly 
sampled from the training data. 


tasks where two sentences belong to two different lan¬ 
guages with similar meanings and we want to make their 
semantic representation as close as possible. 

We now describe how to train the model to achieve the 
above objective using the click-through data logged by a 
commercial search engine. For a complete description of 
the click-through data please refer to section 2 in (321. 
To begin with, we adopt the cosine similarity between 
the semantic vectors of two sentences as a measure for 
their similarity: 


of clicked document given the r-th query, N is number 
of query / clicked-document pairs in the corpus and 


lr{A) = - log 


olR(Qr;D + ) 


g7ii(Q,,D+) I y-" g 


nR{Qr.D-.) 


= log 



(5) 


where = R{Qr,D^) - R{Qr,D~-), R{',') was 
defined earlier in ([^, D~j is the j-th negative candidate 
document for r-th query and n denotes the number of 
negative samples used during training. 

The expression in is a logistic loss over 
It upper-bounds the pairwise accuracy, i.e., the 0-1 
loss. Since the similarity measure is the cosine function, 
A^^y G [—2, 2]. To have a larger range for A^^y, we use 
7 for scaling. It helps to penalize the prediction error 
more. Its value is set empirically by experiments on a 
held out dataset. 

To train the RNN and LSTM-RNN, we use Back Prop¬ 
agation Through Time (BPTT). The update equations for 
parameter A at epoch k are as follows: 


R{Q,D) 


yQ{TQ)^yD{TD) 

l|yQ(^g)ll-||yi^(^D)ll 


(3) 


where Tq and Td are the lengths of the sentence 
Q and sentence D, respectively. In the context of 
training over click-through data, we will use Q and 
D to denote “query” and “document”, respectively. 
In Figure we show the sentence embedding vec¬ 
tors corresponding to the query, yq^Tq), and all the 


documents, {y£,+ (Tz 5 +), y^- (T^-),..., y^- (T^-)}, 
where the subscript denotes the (clicked) positive 
sample among the documents, and the subscript DJ 
denotes the j-th (un-clicked) negative sample. All these 
embedding vectors are generated by feeding the sen¬ 
tences into the RNN or LSTM-RNN model described 
in Sec. Ill and take the y corresponding to the last word 
— see the blue box in Figure 

We want to maximize the likelihood of the clicked 
document given query, which can be formulated as the 
following optimization problem: 


{ N "I AT 

— log n P{D+\Qr)\=mmJ2lr{^) 

r=l ) r=l 

(4) 

where A denotes the collection of the model parameters; 
in regular RNN case, it includes W^ec and W in Figure 
[T] and in LSTM-RNN case, it includes Wi, W 2 , W 3 , 

W4, W-pecl? ^^rec 29 ^^rec 3 ? ^^rec49 

bi, b 2 , bs and b 4 in Figure m D+ is the clicked 
document for r-th query, P(L)+|( 5 r) is the probability 


= -^k ~ -^k-l 

AAk = /i/c-1 AA/c_i — e/c_iVI/(A/c_i + jj^k-iAAk-i) 

( 6 ) 


where VL(-) is the gradient of the cost function in Q, 
e is the learning rate and fik is a momentum parameter 
determined by the scheduling scheme used for training. 
Above equations are equivalent to Nesterov method 
in (331. To see why, please refer to appendix A.l of 
(21 where Nesterov method is derived as a momentum 
method. The gradient of the cost function, VL(A), is: 


N n 


VL(A) = - EEE 


dA 


a. 


r,J,r 


r,3 


r=l j=l r=0 


d\ 


one large update 


(7) 


where T is the number of time steps that we unfold the 
network over time and 


Oiir,j 


-^g-7A,.7 

1 + 


( 8 ) 


in 0 and error signals for different param¬ 
eters of RNN and LSTM-RNN that are necessary for 
training are presented in Appendix Full derivation of 
gradients in both models is presented in section III of 
supplementary materials. 

To accelerate training by parallelization, we use mini¬ 
batch training and one large update instead of incremen¬ 
tal updates during back propagation through time. To 
resolve the gradient explosion problem we use gradient 
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Algorithm 1 Training LSTM-RNN for Sentence Embed¬ 
ding 


Inputs: Fixed step size “e”, Scheduling for “/x”, Gradient clip threshold 
“fhc”, Maximum number of Epochs ''nEpoch”, Total number of query 
/ clicked-document pairs ‘W”, Total number of un-clicked (negative) docu¬ 
ments for a given query “n”, Maximum sequence length for truncated BPTT 
“T”. 


Outputs: Two trained models, one in query side “Aq”, one in document 
side “AI)”. 

Initialization: Set all parameters in Aq and Ad to small random numbers, 
i = 0, k = l. 

procedure LSTM-RNN(Aq,Ad) 
while i < nEpoch do 

for “first minibatch” “last minibatch” do 
r ^ 1 

while r < N do 


for jf = 1 ^ n do 
Compute OLr,j 
Compute Qr 


Compute Y.r=o ^r ,3 


JlS-y r) 

> use to (4^ 


> use 

i appendix [a| 


k,D 


> use 03 to in appendix [a| 
sum above terms for Q and D over j 

end for 

sum above terms for Q and D over r 
r r + 1 

end while 

Compute VL(Afc^Q) > use 

Compute VL(Afc^D) > use 

if II VL(Afc,Q)|| > tho then 

V-^(Afc,Q) 


VL(Afc,Q) ■ 

end if 


tho • 


l|VI.(Afc_Q)|| 


if ||VL(Afc,D)|| > the then 

VL(Afc,D) ^ tho • ||vi.(Afc^j^)|| 

end if 

Compute AA/e^Q > use 

Compute AAfc^D > use 

Update: Ak,Q ^ AAk,Q + Afe-i,Q 
Update: Ak,D ^ AAk,D + Afc_i,D 
/c ^ /c + 1 
end for 
2 ^ i + 1 
end while 
end procedure 


re-normalization method described in lEH, dl- To 
accelerate the convergence, we use Nesterov method (33) 
and found it effective in training both RNN and LSTM- 
RNN for sentence embedding. 

We have used a simple yet effective scheduling for 
/i/c for both RNN and LSTM-RNN models, in the first 
and last 2% of all parameter updates jj^k = 0.9 and for 
the other 96% of all parameter updates jj^k = 0.995. We 
have used a fixed step size for training RNN and a fixed 
step size for training LSTM-RNN. 

A summary of training method for LSTM-RNN is 
presented in Algorithmic 

V. Analysis of the Sentence Embedding 
Process and Performance Evaluation 

To understand how the LSTM-RNN performs sentence 
embedding, we use visualization tools to analyze the 
semantic vectors generated by our model. We would 
like to answer the following questions: (i) How are 
word dependencies and context information captured? 


(ii) How does LSTM-RNN attenuate unimportant infor¬ 
mation and detect critical information from the input 
sentence? Or, how are the keywords embedded into the 
semantic vector? (iii) How are the global topics identified 
by LSTM-RNN? 


To answer these questions, we train the RNN with 
and without LSTM cells on the click-through dataset 
which are logged by a commercial web search engine. 


The training method has been described in Sec. IV 


Description of the corpus is as follows. The training set 
includes 200,000 positive query / document pairs where 
only the clicked signal is used as a weak supervision for 
training LSTM. The relevance judgement set (test set) 
is constructed as follows. Lirst, the queries are sampled 
from a year of search engine logs. Adult, spam, and 
bot queries are all removed. Queries are de-duped so 
that only unique queries remain. To refiex a natural 
query distribution, we do not try to control the quality 
of these queries. Lor example, in our query sets, there 
are around 20% misspelled queries, and around 20% 
navigational queries and 10% transactional queries, etc. 
Second, for each query, we collect Web documents to 
be judged by issuing the query to several popular search 
engines (e.g., Google, Bing) and fetching top-10 retrieval 
results from each. Linally, the query-document pairs are 
judged by a group of well-trained assessors. In this 
study all the queries are preprocessed as follows. The 
text is white-space tokenized and lower-cased, numbers 
are retained, and no stemming/infiection treatment is 
performed. Unless stated otherwise, in the experiments 
we used 4 negative samples, i.e., n = 4 in Lig. 


We now proceed to perform a comprehensive analysis 
by visualizing the trained RNN and LSTM-RNN models. 
In particular, we will visualize the on-and-off behaviors 
of the input gates, output gates, cell states, and the 
semantic vectors in LSTM-RNN model, which reveals 
how the model extracts useful information from the input 
sentence and embeds it properly into the semantic vector 
according to the topic information. 


Although giving the full learning formula for all 
the model parameters in the previous section, we will 
remove the peephole connections and the forget gate 
from the LSTM-RNN model in the current task. This 
is because the length of each sequence, i.e., the number 
of words in a query or a document, is known in advance, 
and we set the state of each cell to zero in the beginning 
of a new sequence. Therefore, forget gates are not a great 
help here. Also, as long as the order of words is kept, the 
precise timing in the sequence is not of great concern. 
Therefore, peephole connections are not that important 
as well. Removing peephole connections and forget gate 
will also reduce the amount of training time, since a 
smaller number of parameters need to be learned. 
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(c) o{t) 


(d) y{t) 


(c) o{t) 


(d) y(t) 


Fig. 4. Query: ''hotels in shanghai’\ Since the sentence ends at 
the third word, all the values to the right of it are zero (green color). 

A. Analysis 

In this section we would like to examine how the in¬ 
formation in the input sentence is sequentially extracted 
and embedded into the semantic vector over time by the 
LSTM-RNN model. 

1) Attenuating Unimportant Information: First, we 
examine the evolution of the semantic vector and how 
unimportant words are attenuated. Specifically, we feed 
the following input sentences from the test dataset into 
the trained LSTM-RNN model: 

• Query: ''hotels in shanghaU 

• Document: "shanghai hotels accommodation hotel 
in shanghai discount and reservation'' 

Activations of input gate, output gate, cell state and the 
embedding vector for each cell for query and document 
are shown in Fig. and Fig.[^ respectively. The vertical 
axis is the cell index from 1 to 32, and the horizontal 
axis is the word index from 1 to 10 numbered from left 
to right in a sequence of words and color codes show 
activation values. From Figs|4||^ we make the following 
observations: 

• Semantic representation y(t) and cell states c(t) are 
evolving over time. Valuable context information is 
gradually absorbed into c(t) and y(t), so that the 
information in these two vectors becomes richer 
over time, and the semantic information of the 
entire input sentence is embedded into vector y(f), 
which is obtained by applying output gates to the 
cell states c(t). 

• The input gates evolve in such a way that it 
attenuates the unimportant information and de¬ 
tects the important information from the input 


Fig. 5. Document: "shanghai hotels aeeommodation hotel 
in shanghai diseount and reservation”. Since the sentence ends 
at the ninth word, all the values to the right of it are zero (green color). 

sentence. For example, in Fig. |^a), most of 
the input gate values corresponding to word 3, 
word 7 and word 9 have very small values 
(light green-yellow colorQ which corresponds to 
the words "accommodation", "discount" and 
"reservation", respectively, in the document sen¬ 
tence. Interestingly, input gates reduce the effect of 
these three words in the final semantic representa¬ 
tion, y(t), such that the semantic similarity between 
sentences from query and document sides are not 
affected by these words. 

2) Keywords Extraction: In this section, we show 
how the trained LSTM-RNN extracts the important in¬ 
formation, i.e., keywords, from the input sentences. To 
this end, we backtrack semantic representations, y(t), 
over time. We focus on the 10 most active cells in 
final semantic representation. Whenever there is a large 
enough change in cell activation value (y(t)), we assume 
an important keyword has been detected by the model. 
We illustrate the result using the above example {"hotels 
in shanghai"). The evolution of the 10 most active cells 
activation, y(t), over time are shown in Fig. for the 
query and the document sentences |^rom Fig.j^we also 
observe that different words activate different cells. In 
Tables m we show the number of cells each word 

Tf this is not clearly visible, please refer to Fig. 1 in section I of 
supplementary materials. We have adjusted color bar for all figures to 
have the same range, for this reason the structure might not be clearly 
visible. More visualization examples could also be found in section IV 
of Supplementary Materials 

^Likewise, the vertical axis is the cell index and horizontal axis is 
the word index in the sentence. 




































TABLE II 


Key words for document: ''shanghai hotels accommodation hotel in shanghai discount and reservation"' 



shanghai 

hotels 

accommodation 

hotel 

in 

shanghai 

discount 

and 

reservation 

Number of assigned 
cells out of 10 

Left to Right 


4 

3 

8 


8 

5 

3 

4 

Number of assigned 
cells out of 10 
Right to Left 

4 

6 

5 

4 

5 

1 

7 

5 




(a) y(t) top 10 for query 


(b) y(t) top 10 for document 


Fig. 6. Activation values, y(t), of 10 most active cells for Query: 
"hotels in shanghai" and Document: "shanghai hotels accommodation 
hotel in shanghai discount and reservation" 


TABLE I 


Key words for query: "hotels in shanghai" 


Query 

hotels 

in 

shanghai 

Number of assigned 
cells out of 10 

Left to Right 


0 

1 

Number of assigned 
cells out of 10 
Right to Left 

6 

0 



activates|^We used Bidirectional LSTM-RNN to get the 
results of these tables where in the first row, LSTM-RNN 
reads sentences from left to right and in the second row 
it reads sentences from right to left. In these tables we 
labelled a word as a keyword if more than 40% of top 
10 active cells in both directions declare it as keyword. 
The boldface numbers in the table show that the number 
of cells assigned to that word is more than 4, i.e., 40% 
of top 10 active cells. From the tables, we observe that 
the keywords activate more cells than the unimportant 
words, meaning that they are selectively embedded into 
the semantic vector. 

3) Topic Allocation: Now, we further show that the 
trained LSTM-RNN model not only detects the key¬ 
words, but also allocates them properly to different cells 
according to the topics they belong to. To do this, we go 
through the test dataset using the trained LSTM-RNN 
model and search for the keywords that are detected 

^Note that before presenting the first word of the sequence, activation 
values are initially zero so that there is always a considerable change in 
the cell states after presenting the first word. For this reason, we have 
not indicated the number of cells detecting the first word as a keyword. 
Moreover, another keyword extraction example can be found in section 
IV of supplementary materials. 


by a specific cell. For simplicity, we use the following 
simple approach: for each given query we look into the 
keywords that are extracted by the 5 most active cells 
of LSTM-RNN and list them in Table |nl| Interestingly, 
each cell collects keywords of a specific topic. For 
example, cell 26 in Table III extracts keywords related 
to the topic “food” and cells 2 and 6 mainly focus on 
the keywords related to the topic “health”. 


B. Performance Evaluation 


1) Web Document Retrieval Task: In this section, we 
apply the proposed sentence embedding method to an 
important web document retrieval task for a commercial 
web search engine. Specifically, the RNN models (with 
and without LSTM cells) embed the sentences from the 
query and the document sides into their corresponding 
semantic vectors, and then compute the cosine similarity 
between these vectors to measure the semantic similarity 
between the query and candidate documents. 

Experimental results for this task are shown in Table 
[Tvl using the standard metric mean Normalized Dis¬ 
counted Cumulative Gain (NDCG) 1361 (the higher the 
better) for evaluating the ranking performance of the 
RNN and LSTM-RNN on a standalone human-rated test 
dataset. We also trained several strong baselines, such as 
DSSM |3l and CLSM HOl, on the same training dataset 
and evaluated their performance on the same task. For 
fair comparison, our proposed RNN and LSTM-RNN 
models are trained with the same number of parameters 
as the DSSM and CLSM models (14.4M parameters). 
Besides, we also include in Table Hy] two well-known 
information retrieval (IR) models, BM25 and PLSA, for 
the sake of benchmarking. The BM25 model uses the 
bag-of-words representation for queries and documents, 
which is a state-of-the-art document ranking model based 
on term matching, widely used as a baseline in IR 
society. PLSA (Probabilistic Latent Semantic Analysis) 
is a topic model proposed in Ea, which is trained 
using the Maximum A Posterior estimation l38l on 
the documents side from the same training dataset. We 
experimented with a varying number of topics from 100 
to 500 for PLSA, which gives similar performance, and 


we report in Table |IV| the results of using 500 topics. 
Results for a language model based method, uni-gram 
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TABLE III 

Keywords assigned to each cell oe LSTM-RNN eor dieeerent queries oe two topics, “eood” and “health” 


Query 

cell 1 

cell 2 

cell 3 

cell 4 

cell 5 

ceU 6 

cell 7 

ceU 8 

ceU 9 

cell 10 

ceU 11 1 

cell 12 

cell 13 

cell 14 

ceU 15 

cell 16 

al yo yo sauce 





yo 



sauce 



sauce 






atkins diet lasagna 








diet 









blender recipes 

















cake bakery edinburgh 










bakery 







canning corn beef hash 





beef, hash 












torre de pizza 

















famous desserts 







desserts 










fried chicken 



chicken 




chicken 










smoked turkey recipes 

















italian sausage hoagies 








sausage 









do you get allergy 


allergy 















much pain will after total knee replacement 

pain 





pain, knee 











how to make whiter teeth 













make, teeth 


to 


illini community hospital 


community, hospital 






hospital 


community 







implant infection 


infection 




infection 











introductory psychology 


psychology 




psychology 











narcotics during pregnancy side effects 


pregnancy 




pregnancy,effects, during 







during 




fight sinus infections 






infections 











health insurance high blood pressure 


insurance 




blood 


high, blood 









all antidepressant medications 


antidepressant, medications 















Query 

cell 17 

cell 18 

ceU 19 

cell 20 

cell 21 

cell 22 

cell 23 

cell 24 

cell 25 

cell 26 

ceU 27 

cell 28 

ceU 29 

cell 30 

cell 31 

cell 32 


al yo yo sauce 

















atkins diet lasagna 







diet 







diet 



blender recipes 










recipes 







cake bakery edinburgh 




bakery 






bakery 







canning com beef hash 










com, beef 







torre de pizza 










pizza 



pizza 




famous desserts 

















fried chicken 










chicken 







smoked turkey recipes 




turkey 






recipes 







italian sausage hoagies 

hoagies 





sausage 




sausage 







do you get allergy 

















much pain will after total knee replacement 

knee 






replacement 










how to make whiter teeth 










whiter 







illini community hospital 






hospital 








hospital 



implant infection 









infection 








introductory psychology 












psychology 





narcotics during pregnancy side effects 

















fight sinus infections 

sinus, infections 








infections 








health insurance high blood pressure 







high, pressure 







insurance,high 



all antidepressant medications 









antidepressant 





medications 




language model (ULM) with Dirichlet smoothing, are 
also presented in the table. 

To compare the performance of the proposed method 
with general sentence embedding methods in document 
retrieval task, we also performed experiments using two 
general sentence embedding methods. 

1) In the first experiment, we used the method pro¬ 
posed in (21 that generates embedding vectors 
known as Paragraph Vectors. It is also known as 
doc2vec. It maps each word to a vector and then 
uses the vectors representing all words inside a 
context window to predict the vector representation 
of the next word. The main idea in this method is 
to use an additional paragraph token from previ¬ 
ous sentences in the document inside the context 
window. This paragraph token is mapped to vector 
space using a different matrix from the one used to 
map the words. A primary version of this method 
is known as word2vec proposed in (3^ . The only 
difference is that word2vec does not include the 
paragraph token. 

To use doc2vec on our dataset, we first trained 
doc2vec model on both train set (about 200,000 
query-document pairs) and test set (about 900,000 
query-document pairs). This gives us an embed¬ 
ding vector for every query and document in the 
dataset. We used the following parameters for 
training: 

• min-count=l : minimum number of of words 


per sentence, sentences with words less than 
this will be ignored. We set it to 1 to make 
sure we do not throw away anything. 

• window=5 : fixed window size explained in 
fa). We used different window sizes, it re¬ 
sulted in about just 0.4% difference in final 
NDCG values. 

• size=100 : feature vector dimension. We used 
400 as well but did not get significantly dif¬ 
ferent NDCG values. 

• sample=le-4 : this is the down sampling ratio 
for the words that are repeated a lot in corpus. 

• negative=5 : the number of noise words, i.e., 
words used for negative sampling as explained 
in (3. 

• We used 30 epochs of training. We ran an ex¬ 
periment with 100 epochs but did not observe 
much difference in the results. 

• We used gensim (401 to perform experiments. 

To make sure that a meaningful model is trained, 
we used the trained doc2vec model to find the 
most similar words to two sample words in our 
dataset, e.g., the words “pizza” and “infection”. 
The resulting words and corresponding scores are 
presented in section V of Supplementary Materi¬ 
als. As it is observed from the resulting words, 
the trained model is a meaningful model and can 
recognise semantic similarity. 

Doc2vec also assigns an embedding vector for 
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each query and document in our test set. We used 
these embedding vectors to calculate the cosine 
similarity score between each query-document pair 
in the test set. We used these scores to calcu¬ 
late NDCG values reported in Table W for the 
Doc2Vec model. 

Comparing the results of doc2vec model with 
our proposed method for document retrieval task 
shows that the proposed method in this paper 
significantly outperforms doc2vec. One reason for 
this is that we have used a very general sen¬ 
tence embedding method, doc2vec, for document 
retrieval task. This experiment shows that it is 
not a good idea to use a general sentence embed¬ 
ding method and using a better task oriented cost 
function, like the one proposed in this paper, is 
necessary. 

2) In the second experiment, we used the Skip- 
Thought vectors proposed in O. During train¬ 
ing, skip-thought method gets a tuple (s{t — 
l),s(t),5(t + 1)) where it encodes the sentence 
s{t) using one encoder, and tries to reconstruct 
the previous and next sentences, i.e., s(t — 1) and 
s{t using two separate decoders. The model 
uses RNNs with Gated Recurrent Unit (GRU) 
which is shown to perform as good as LSTM. 
In the paper, authors have emphasized that: ''Our 
model depends on having a training corpus of con¬ 
tiguous text". Therefore, training it on our training 
set where we barely have more than one sentence 
in query or document title is not fair. However, 
since their model is trained on 11,038 books from 
BookCorpus dataset 171 which includes about 74 
million sentences, we can use the trained model 
as an off-the-shelf sentence embedding method as 
authors have concluded in the conclusion of the 
paper. 

To do this we downloaded their trained mod¬ 
els and word embeddings (its size was more 
than 2GB) available from ‘ https://github.com/ 
ryankiros/skip-thoughts ’. Then we encoded each 
query and its corresponding document title in our 
test set as vector. 

We used the combine-skip sentence embedding 
method, a vector of size 4800 x 1, where it is 
concatenation of a uni-skip, i.e., a unidirectional 
encoder resulting in a 2400 x 1 vector, and a bi¬ 
skip, i.e., a bidirectional encoder resulting in a 
1200 X 1 vector by forward encoder and another 
1200 X 1 vector by backward encoder. The authors 
have reported their best results with the combine- 
skip encoder. 

Using the 4800 x 1 embedding vectors for each 
query and document we calculated the scores and 



Fig. 7. LSTM-RNN compared to RNN during training: The vertical 
axis is logarithmic scale of the training cost, L(A), in Horizontal 
axis is the number of epochs during training. 


NDCG for the whole test set which are reported 
in Table lEl 

The proposed method in this paper is perform¬ 
ing significantly better than the off-the-shelf skip- 
thought method for document retrieval task. Nev¬ 
ertheless, since we used skip-thought as an off- 
the-shelf sentence embedding method, its result 
is good. This result also confirms that learning 
embedding vectors using a model and cost function 
specifically designed for document retrieval task is 
necessary. 


As shown in Table |IV| the LSTM-RNN significantly 
outperforms all these models, and exceeds the best 
baseline model (CLSM) by 1.3% in NDCG@1 score, 
which is a statistically significant improvement. As we 
pointed out in Sec. V-A such an improvement comes 
from the LSTM-RNN’s ability to embed the contextual 
and semantic information of the sentences into a finite 
dimension vector. In Table IV we have also presented 
the results when different number of negative samples, 
n, is used. Generally, by increasing n we expect the 
performance to improve. This is because more nega¬ 
tive samples results in a more accurate approximation 
of the partition function in The results of using 
Bidirectional LSTM-RNN are also presented in Table 
|IV[ In this model, one LSTM-RNN reads queries and 
documents from left to right, and the other LSTM-RNN 
reads queries and documents from right to left. Then the 
embedding vectors from left to right and right to left 
LSTM-RNNs are concatenated to compute the cosine 
similarity score and NDCG values. 


A comparison between the value of the cost function 
during training for LSTM-RNN and RNN on the click¬ 
through data is shown in Fig. [7] From this figure, 
we conclude that LSTM-RNN is optimizing the cost 
function in 0 more effectively. Please note that all 
parameters of both models are initialized randomly. 
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TABLE IV 

Comparisons of NDCG performance measures (the higher 

THE BETTER) OE PROPOSED MODELS AND A SERIES OE BASELINE 
MODELS, WHERE nhid REEERS TO THE NUMBER OE HIDDEN UNITS, 
ncell REEERS TO NUMBER OE CELLS, win REEERS TO WINDOW SIZE, 
AND n IS THE NUMBER OE NEGATIVE SAMPLES WHICH IS SET TO 4 
UNLESS OTHERWISE STATED. UNLESS STATED OTHERWISE, THE 
RNN AND LSTM-RNN MODELS ARE CHOSEN TO HAVE THE SAME 
NUMBER OE MODEL PARAMETERS AS THE DSSM AND CLSM 
MODELS: 14.4M, WHERE IM = 10®. THE BOLDEACE NUMBERS 


ARE THE BEST RESULTS. 


Model 

NDCG 

@1 

NDCG 

@3 

NDCG 

@10 

Skip-Thought 

off-the-shelf 

26.9% 

29.7% 

36.2% 

Doc2Vec 

29.1% 

31.8% 

38.4% 

ULM 

30.4% 

32.7% 

38.5% 

BM25 

30.5% 

32.8% 

38.8% 

PLSA (T=500) 

30.8% 

33.7% 

40.2% 

DSSM (nhid = 288/96) 

2 Layers 

31.0% 

34.4% 

41.7% 

CLSM (nhid = 288/96, win=l) 

2 Layers, 14.4 M parameters 

31.8% 

35.1% 

42.6% 

CLSM (nhid = 288/96, win=3) 

2 Layers, 43.2 M parameters 

32.1% 

35.2% 

42.7% 

CLSM (nhid = 288/96, win=5) 

2 Layers, 72 M parameters 

32.0% 

35.2% 

42.6% 

RNN (nhid = 288) 

1 Layer 

31.7% 

35.0% 

42.3% 

LSTM-RNN (ncell = 32) 

1 Layer, 4.8 M parameters 

31.9% 

35.5% 

42.7% 

LSTM-RNN (ncell = 64) 

1 Layer, 9.6 M parameters 

32.9% 

36.3% 

43.4% 

LSTM-RNN (ncell = 96) 

1 Layer, n = 2 

32.6% 

36.0% 

43.4% 

LSTM-RNN (ncell = 96) 

1 Layer, n = 4 

33.1% 

36.5% 

43.6% 

LSTM-RNN (ncell = 96) 

1 Layer, n = 6 

33.1% 

36.6% 

43.6% 

LSTM-RNN (ncell = 96) 

1 Layer, n = 8 

33 . 1 % 

36 . 4 % 

43 . 7 % 

Bidirectional LSTM-RNN 
(ncell = 96), 1 Layer 

33 . 2 % 

36 . 6 % 

43 . 6 % 


VI. Conclusions and Future Work 

This paper addresses deep sentence embedding. We 
propose a model based on long short-term memory to 
model the long range context information and embed the 
key information of a sentence in one semantic vector. We 
show that the semantic vector evolves over time and only 
takes useful information from any new input. This has 
been made possible by input gates that detect useless 
information and attenuate it. Due to general limitation 
of available human labelled data, we proposed and 
implemented training the model with a weak supervision 
signal using user click-through data of a commercial web 
search engine. 

By performing a detailed analysis on the model, we 
showed that: 1) The proposed model is robust to noise, 
i.e., it mainly embeds keywords in the final semantic 
vector representing the whole sentence and 2) In the pro¬ 
posed model, each cell is usually allocated to keywords 


from a specific topic. These findings have been supported 
using extensive examples. As a concrete sample appli¬ 
cation of the proposed sentence embedding method, we 
evaluated it on the important language processing task of 
web document retrieval. We showed that, for this task, 
the proposed method outperforms all existing state of the 
art methods significantly. 

This work has been motivated by the earlier successes 
of deep learning methods in speech BTl . 1421 . (4^ . 
O, fl31 and in semantic modelling O, HOl, ll46l . 
ia, and it adds further evidence for the effectiveness 
of these methods. Our future work will further extend 
the methods to include 1) Using the proposed sentence 
embedding method for other important language pro¬ 
cessing tasks for which we believe sentence embedding 
plays a key role, e.g., the question / answering task. 2) 
Exploit the prior information about the structure of the 
different matrices in Fig. to develop a more effective 
cost function and learning method. 3) Exploiting atten¬ 
tion mechanism in the proposed model to improve the 
performance and find out which words in the query are 
aligned to which words of the document. 

Appendix A 

Expressions for the Gradients 

In this appendix we present the final gradient expres¬ 
sions that are necessary to use for training the proposed 
models. Full derivations of these gradients are presented 
in section III of supplementary materials. 

A. RNN 

For the recurrent parameters, A = W^ec (we have 
ommitted r subscript for simplicity): 

- T -1)+ 

“ '^)yS+ ^ - 1)] - [^VQ V - T)yQ{t - r - 1) 

+ it - 'r)y^- it-T- 1)] (9) 

3 

where DJ means j-th candidate document that is not 
clicked and 

^yqit-T-l) = (1 -yQ(i-T - l))o 

i'^ + yqit-T-l))oW'^^^5y^{t-T) (10) 

and the same as for (t — T — 1) with D subscript 

for document side model. Please also note that: 

6y^{TQ) = (1 - ygiTg)) o (1 + yQ{TQ))o 
{b.c.yDiTo) - a.b^.c.yQ{TQ)), 

^yoiTo) = (1 - yDiTo)) o (1 + yD(TD))o 
ib.c.yQ{TQ)-a.b.c\yDiTD)) dD 
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where 


a = ygit = TqY'yDit = To) 

1 1 


( 12 ) 


\\yQ{t = TQ)r \\yD{t = TD)\\ 

For the input parameters, A = W: 

[^VQ (.t - -t) + 5yl {t - t)\^ {t - r)] (13) 

3 

A full derivation of BPTT for RNN is presented in 
section III of supplementary materials. 


B. LSTM-RNN 

Starting with the cost function in Q, 


of Arj, we have: 


dA 


r,J 


dR{Qr,D+) dR{Qr 


' ^r,j ) 


dWi 


(19) 

The derivative for output gate bias values will be: 


dR(^Q , -P) _ creel f^\ _i_ rrecl/j.\ 


( 20 ) 


2) Input Gate: For the recurrent connections we have: 
dR{Q,D) ^ 
dWreeS 

diag{5Z\t)). 


reezuw !;recZ u\\ dcoit) 

"VQ 


dWrecS 


+ diag{SZ%t)). 


dWrecS 

( 21 ) 


where 


use the Nesterov method described in ([^ to 
update LSTM-RNN model parameters. Here, A 
is one of the weight matrices or bias vectors 
{Wi, W 2 , W 3 , W 4 , W.ecl, W„c 2 , W,ec3, W,ec4 
,Wpi,Wp 2 ,Wp 3 ,bi,b 2 ,b 3 ,b 4 } in the LSTM-RNN 
architecture. The general format of the gradient of the 
cost function, Vi(A), is the same as 0- By definition 


we (5:^ (i) = (1 - (i(cQ W)) o (1 + HcQit))) o OQ{t) o WQ{t) 


= diag{iQ{t )).^^^^—^ -f bi,Q(f).yQ(f - 1)'^ 


dWrec3 dWrec3 

hi,Q{t) = yg,Qit) o iQ{t) o (1 - iQ{t)) 


( 22 ) 


In equation ( [^ , Sy^^^{t) and are the same as 


( [22| ) with D subscript. For the input connections we will 
have the following: 

dR{Q,D) 


(14) 

dA dA dA ^ ^ 

We omit r and j subscripts for simplicity and present 
different parameters of each cell of LSTM- 
RNN in the following subsections. This will complete 
the process of calculating VI/(A) in 0 and then we 
can use @ to update LSTM-RNN model parameters. 
In the subsequent subsections vectors vg and vjj are 
defined as: 

VQ = {b.c.yoit = Td) - a.b^.c.yqit = Tq)) 

\D = (b.c.yQ{t = Tq) - a.b.c^.yD{t = Td)) (15) 

where a, b and c are defined in (TT) . Full derivation of 
truncated BPTT for LSTM-RNN model is presented in 
section III of supplementary materials. 

1) Output Gate: For recurrent connections we have: 

(f).yQ(f-l)^ + ,5;f (f).yo(f-l)^ 

(16) 


aw3 


rrec3/.xx dCQ{t) .■,)rrec3tD\ ^Cvit) 


diag{5ll^\t)).^^ + diag{5l^^\t)). 


d'W: 


dWs 


(23) 


where 

dcQ{t) 


= diag{fQ{t)).^^2^^ + bj,Q(f).XQ(f)^ 


dWs ' dWs 

For the peephole connections we will have: 
dR{Q,D) ^ 

dWp3 

+ diag{5l‘^f{t)).- 


(24) 


p3 




p3 


(25) 


where 

dcQ{t) 




dWrecl 

where 

= OQW°(l-OQW)°(i(cQ(i))°VQ(f) (17) 

and the same as (dll' for with subscript D for 

document side model. For input connections, Wi, and 
peephole connections, W^i, we will have: 

dm,D) ^^recl(^t).lQ{t)^ +Sll^\t).lDit)^ (18) 


dWp3 dWp3 

For bias values, ba, we will have: 

dRiQ,D) ^ 
dh3 

diagiSlfit)).^^ + diag{5ll^{t)).^''^^^'^ 


(26) 


c)b3 


Oba 


(27) 


where 

dcQ{t) 

Oba 


= diag{tQ{t )). ^^^^ —^ -f bi,Q(f) (28) 

(703 
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3) Forget Gate: For the recurrent connections we will 
have: 

dR{Q,D) 


where 

rrec4 

^VQ 


rec2 


dW, ___ 


[t) = (1 - h{cQ{t))) o (1 + h{cQ{t))) o OQ{t) o VQ{t) 

^g,Q{t)-yQ{t - 




dW. 


rec2 


dWrec2 

(29) 


dWrec4 " dWrec4 

bg,Q(0 = iQ(*) ° (1 - yg,Qit)) ° (1 + W) (38) 

For input connection we have: 
dR{Q,D) _ 

{t) = {l-h{CQ{t)))o{l + h{CQ{t)))oOQ{t)oVQ{t) dcr-,(f] dcr,(f) 


where 


rrec2 

^VQ 


dWrec2 dWrec2 

b/.Q(i) = CQ(t - 1) O fQ(t) o (1 - fQ(t)) (30) 

For input connections to forget gate we will have: 

dR{Q,D) _ 


dWo 


diagiSZ^^t)). 


dcQ{t) 


VQ dW2 


where 

dcQjt) 


= diag{tQ{t)). 


dcQ{t-l) 


dW2 

For peephole connections we have: 

dR{Q,D) 


dW. 


p2 


dW, 


p2 


5W, 


p2 


where 


where 

dcQ{t) 


= diag{tQ{t)).^^^^^-3)- + hg^Q{t).XQ{tf 


diagiSZfm^ll^ (31) 

■b/,Q(f)-X(5(f)^ 

(32) 


5W4 

For bias values we will have: 
dR{Q,D) ^ 
dh^ 

diag{5ll^{t)). 


(40) 


dcQjt) 

dhi 


-diag{5Z^^{t)).^^^31 ( 41 ) 


where 

dcQjt) 

dhA 


= diag{tQ{t)). 


dcQ{t-l) 

dhA 


dhA 


dOg,Q{t) (42) 


dia9{5ll^\t)).^-^+diag{5Zf{t)).^§^ (33) 


5) Error signal backpropagation: Error signals are 
back propagated through time using following equations: 

[OQ{t - 1) O (1 - OQ{t - 1)) o h{CQ{t - 1))] 

^^reel-dQ^^Ht) (43) 


(34) 

For forget gate’s bias values we will have: 

dR{Q,D) ^ 
dh2 

dia9iSZf{t))-^^+diag{5Zf{t))-^^ (35) 
where 

= diag{fQ{t)).^^^^^ + b/,Q(f) (36) 

4) Input without Gating (yg{t)): For recurrent con¬ 
nections we will have: 

dR{Q,D) ^ 

dW rec4 

( 37 ) 


- 1 ) = [(1 - hicQ{t - 1 ))) ° (1 + h{cQ{t - 1 ))) 

ooQ(f-l)]oWL.5^“^(f), for ie {2,3,4} 

(44) 
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Fig. 8. Input gate, i(t), for Document: ''shanghai hotels accommoda¬ 
tion hotel in shanghai discount and reservation'' 

Supplementary 

Material 


Appendix B 

A MORE CLEAR FIGURE FOR INPUT GATE FOR 
''hotels in shanghai'' EXAMPLE 

In this section we present a more clear figure for part 
(a) of Fig. 5 that shows the structure of the input gate for 
document side of "hotels in shanghai" example. As it 
is clearly visible from this figure, the input gate values 
for most of the cells corresponding to word 3, word 
7 and word 9 in document side of LSTM-RNN have 
very small values (light green-yellow color). These are 
corresponding to words "aeeommodation", "diseount" 
and "reservation" respectively in the document title. 
Interestingly, input gates are trying to reduce effect 
of these three words in the final representation (y(t)) 
because the LSTM-RNN model is trained to maximize 
the similarity between query and document if they are a 
good match. 


Appendix C 

A Closer Look at RNNs with and without 
LSTM Cells in Web Document Retrieval Task 

In this section we further show examples to reveal 
the advantage of LSTM-RNN sentence embedding com¬ 
pared to the RNN sentence embedding. 

First, we compare the scores assigned by trained RNN 
and LSTM-RNN to our "hotels in shanghai" example. 
On average, each query in our test dataset is associated 
with 15 web documents (URLs). Each query / document 
pair has a relevance label which is human generated. 
These relevance labels are “Bad”, “Fair”, “Good” and 


TABLE V 

RNNs WITH & WITHOUT LSTM CELLS LOR THE SAME QUERY: 
"hotels in shanghai" 



hotels 

in 

shanghai 

Number of assigned 
cells out of 10 (LSTM-RNN) 

_ 

0 

1 

Number of assigned 
neurons out of 10 (RNN) 

_ 

2 

9 


“Excellent”. This example is rated as a “Good” match 
in the dataset. The score for this pair assigned by RNN 
is “0.8165” while the score assigned by LSTM-RNN is 
“0.9161”. Please note that the score is between 0 and 
1. This means that the score assigned by LSTM-RNN is 
more correspondent with the human generated label. 

Second, we compare the number of assigned neurons 
and cells to each word by RNN and LSTM-RNN respec¬ 
tively. To do this, we rely on the 10 most active cells 
and neurons in the final semantic vectors in both models. 
Results are presented in Table [V] and Table VI for query 
and document respectively. An interesting observation 
is that RNN sometimes assigns neurons to unimportant 
words, e.g., 6 neurons are assigned to the word "in" in 
Table ED 

As another example we consider the query, 
"how to fix bath tub wont turn off". This 
example is rated as a “Bad” match in the dataset by 
human. It is good to know that the score for this pair 
assigned by RNN is “0.7016” while the score assigned 
by LSTM-RNN is “0.5944”. This shows the score 
generated by LSTM-RNN is closer to human generated 
label. 

Number of assigned neurons and cells to each word 


by RNN and LSTM-RNN are presented in Table VII 


and Table |VIII| for query and document. This is out of 
10 most active neurons and cells in the semantic vector 
of RNN and LSTM-RNN. Examples of RNN assigning 
neurons to unimportant words are 3 neurons to the word 
“a” and 4 neurons to the word “you” in Table |VIII 


Appendix D 

Derivation of BPTT for RNN and LSTM-RNN 

In this appendix we present the full derivation of the 
gradients for RNN and LSTM-RNN. 


rivation of BPTT for RNN 


A. Derivation < 

Prom (4) and (5) we have: 

dUK) dAri ,, 

r=l r=l j=l 


where 




—70 


— l^r 


1 + 
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TABLE VI 

RNNS WITH & WITHOUT LSTM CELLS LOR THE SAME DOCUMENT: '"shanghai hotels accommodation hotel in shanghai discount and 

reservation” 



shanghai 

hotels 

accommodation 

hotel 

in 

shanghai 

discount 

and 

reservation 

Number of assigned 
cells out of 10 (LSTM-RNN) 

_ 

4 

3 

8 

1 

8 

5 

3 

4 

Number of assigned 
neurons out of 10 (RNN) 

_ 

10 

7 

9 

6 

8 

3 

2 

6 


TABLE VII 


RNN VERSUS LSTM-RNN EOR Query: "how to fix bath tub wont turn off” 



how 

to 

fix 

bath 

tub 

wont 

turn 

off 

Number of assigned 
cells out of 10 (LSTM-RNN) 

_ 

0 

4 

7 

6 

3 

5 

0 

Number of assigned 
neurons out of 10 (RNN) 

_ 

1 

10 

4 

6 

2 

7 

1 


TABLE VIII 

RNN VERSUS LSTM-RNN EOR Document: "how do you paint a bathtub and what paint should...” 



how 

do 

you 

paint 

a 

bathtub 

and 

what 

paint 

should you . . . 

Number of assigned 
cells out of lO(LSTM-RNN) 

_ 

1 

1 

7 

0 

9 

2 

3 

8 

4 

Number of assigned 
neurons out of lO(RNN) 

_ 

1 

4 

4 

3 

7 

2 

5 

4 

7 


and 


= R{Qr, D+) - R{Qr, Dr,j) ( 47 ) 

We need to find for input weights and recurrent 

weights. We omit r subscript for simplicity. 

1) Recurrent Weights: 


dAj _ dR{Q,D^) dR{Q, ) 


dWr 


dWr 


dWr 


We divide R{D,Q) into three components: 

R{Q,D) = YQit = TQfyoit = To) • 

"-V-" 

a 

1 1 


l|yQ(i = rQ)|| ■ \\yD{t = TD)\\ 


then 


dR{Q,D) da db 

dWrec ^ dWrec' ^ OWrec'^ ^ 


D 


a.b. 


dc 


dWr 


We have 

dyqit = Tq^yoit = TD).b.c 


D 


dWrec 

dyqit = TQ)'^yD{t = TD).b.c dyqit = Tq) 


( 48 ) 


dyQ{t = TQ) ■ dW 

rec 

dyqjt = Tq^ypit = TD).b.c dypjt = Tp) 
dyp{t = Tp) dWrec 

(+ T \ (h X = Tp) 

yq{t = Tq).{b.c) 


dWr 


( 51 ) 


b.c 


( 49 ) 


Since /(■) = tanh{.), using chain rule we have 
dyqjt = Tq) ^ 

w 

[(1 - yQ{t = Tq)) o (1 + yq{t = Tq))]yq{t - 1)^ 

(52) 

and therefore 

D = [b.c.yp{t = Tp) o (1 - yq{t = Tq))o 
(1 + ygh = Tq))]yq{t - 1)^+ 

[b.c.yq{t = Tq) o (1 - yp{t = Tp))o 

{l + yp{t = Tp))]yp{t-l)'^ (53) 


( 50 ) 


To find E we use following basic rule: 

|-||x-a||2= 

dx l|x-a||2 


( 54 ) 
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Therefore 


E = a.c.- 


d 


dWrec 
- a.c.{\\yQ{t = Tq)\\) 


{\\yQ{t = TQ)\\)-^ = 

-2 ^llyQ(^ = rQ)ll 


-a.c.{\\yQ{t = TQ)\\) 


dWrec 

-2 yQit = TQ) dyQ{t = TQ) 
■\\yQ{t = TQ)\\ dWrec 


= -[a.c.b^.yqit = Tq) o (1 - yQ{t = Tq))o 
{ l + yQ{t = TQ))]yQ{t-l) 

F is calculated similar to ( [55l >: 


(55) 


F = —[a.h.c^.yD{t = Td) o (1 — yoit = Td))o 
{l + yD{t = TD))]yD{t-l) (56) 

Considering (l5§,(|5g,@ and (|56|i we have: 


dR{Q,D) 

dWrec 


= SyQ{t)yQit - + ^yDit)yD{t - if 

(57) 


2) Input Weights: Using a similar procedure we will 
have the following for input weights: 

(62) 

where 

^yQi^-T) = (l-yQ(i-r))o(l + yQ(t-T))o 

{b.c.yoit -t) - a.b^.c.yqit - r)), 

^yoit-T) = (l-yr>(i-T))o(H-y£,(t-T))o 
{b.c.yQ{t - t) - a.b.c^.yoit - r)) (63) 

Therefore: 
dAj^r ^ 

dW 

-f+ - ^)1d+(^ -'t)]- 

[^vi (t - -t) + Syi {t - T)f- {t - r)] (64) 

3 

and therefore: 


where 

Syft = Tq) = {l-yQ{t = Tq)) o(l+yQ{t = Tq))o 
{ b.c.yoit = Td) - a.b^.c.yQ(t = Tq)), 

^yoi^ = Td) = (1 — yoit = Td)) o (1 + yr>(i = Td))° 
{b.c.yQ{t = Tq) - a.b.c^.yDit = Td)) (58) 

Equation ( [5^ will just unfold the network one time step, 
to unfold it over rest of time steps using backpropagation 
we have: 


^yQ^t - r - 1) = (1 - yQ{t - r - l))o 
i^ + yQ{t-r- 1)) oW^^fy^it - t), 
<^yi>(i-'r- 1) = {1 - yoit - T - l))o 

{1 + yDit-T-l))oW'^^fyr,it-T) (59) 


where r is the number of time steps that we unfold the 
network over time which is from 0 to Tq and Td for 
queries and documents respectively. Now using ( [48] ) we 
have: 

|^ = lC«-)yS« — 1 )+ 

- '^)yS+ ^ -1)] - (t - fyQ(t - T -1) 

+ ^VD {t - T)y'^-(t-T- 1)] (60) 

3 


To calculate final value of gradient we should fold back 
the network over time and use ( [45] ), we will have: 


dL{A) 

dWrec 


N n T 

~ ^T,j,TD,Q 

r=l j=l r=0 


dArj^r 

dWrec 


(61) 


dL{A) 

dW 


N n T ^ . 


r,J 


r=l j=l r=0 


dW 


one large update 


(65) 


B. Derivation of BPTT for LSTM-RNN 

Following from for every parameter, 

LSTM-RNN architecture we have: 


dR{Q,D) 

dA 


da ^ 

= 


db ^ dc 


D 


E 


A, in 

( 66 ) 


and from 

D = yoit = + 

/'+ D \ u 9yD{t = TD) 
yQ{t = TQ).b.c. - — - (67) 

From ( [^ and ( |^ we have: 

E = -a.c.b\yQ{t = TQ)^^^^^^^fTl ( 68 ) 


F = —a.b.c^.yD{t = Td) 


Therefore 


dR{Q,D) 


= D + E 


VQ 


dA 

dyQ{t = TQ) 

dA 


■ ^D- 


dyD{t = Td) 

aX 

F = 

dyD{t = Td) 


dA 


where 


(69) 


(70) 


VQ = [b.c.yDit = Td) - a.b^.c.yQ{t = Tq)) 

\D = (b.c.yQit = Tq) - a.b.c^.yD{t = Td)) (71) 


one large update 
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1) Output Gate: Since = diag{a)P = diag{P)a 
where diag{a) is a diagonal matrix whose main diagonal 
entries are entries of vector a, we have: 


Therefore 

dc{t) 


= E 


dy{t) d 


rec3 ^^rec3 

K—l S. .... 


ddiag{y,(k)) + diag{y,{k))/'‘''‘'' 


w 


rec3 


dWrecl dWrecl 
ddiag{h{c{t))) 


{diag{h{c{t))).o{t)) 


aw, 


reel 


.o(t) + diag{h{c{t))). 


do{t) 


dW^ 


reel 


= ^ diag{yg{k)).i{k) o (1 - i{k)).y{k - 1 )^ (80) 

(81) 


k=l 


= o{t) O (1 - o{t)) O h{c{t)).y{t - 1)^ (72) 

Substituting ( |7^ in ( |70| ) we have: 

(73) 

where 


and 


aw. 


ree3 


= ° “ ^(c(i))) ° (1 + ^(c(i))) 

a(t) 


/c=l 


o yg(fc) o i(fc) o (1 - i(fc))]y(fc - 1)^ 

'-V-" 

b(fe) 


(82) 


5-1 ( 

<5—l(i) = OD{t) O (1 - OD{t)) O h{CD{t)) O VD{t) 


= OQ(.t) o (1 - OQ{t)) o HcQ{t)) o VQ(i) 


But this is expensive to implement, to resolve it we have: 

t-i 


(74) 


dyjt) 

aw. 


rec3 


= ^[a(i)ob(fc)]y(A:-l)^ 


k=l 


with a similar derivation for Wi and W^i we get: 


expensive part 


(75) 

+ [a(/) o b(/)]y(/ - 1)^ 


t-1 

= diag{a.{t))'^h{k).y{k - 1)^ 

u _1 


(O-CQ+ <5-i(/).c^(/)^ (76) 

/v—1 

"- V -' 

9c(t — 1) 
dWrir-ecS 


+ diag{a{t)).h{t).y{t - l)"^ 

(83) 


For output gate bias values we have: 

=sr:\t)+siTit) (77) 


_ creeli creel/ 


dhi 

2) Input Gate: Similar to output gate we start with: 
dy{t) d 


Therefore 
dy{t) 
dW, 


rec3 


= [diag{a{t))][^§2_^ + b(/).y(i - 1 )^] 


aw. 


rec3 


dWrec3 dWrec3 
ddiag{o(t)) 


{diag{o{t)).h{c{t))) 

dh{c{t)) 


For f(t) 7 ^ 1 we have 

dyit) 


(84) 


= [diag(a.{t))][diag(f{t)). 


aw. 


rec3 


.h{c{t)) + diag{o(t)). 


= diag{o(t)).{1 — h{c{t))) o (1 + h{c{t))) 


dWrec3 

dc{t) 


dWrec3 
+ bi(/).y(/-l)'i’] 


dc{t — 1 ) 
9W,ee3 


(85) 


where 


dWrec3 

(78) 


To find 


dc{t) 


assuming f(t) = 1 (we derive formula- 


^WrecS 

tion for f(t) 7 ^ 1 from this simple solution) we have: 
c( 0 ) = 0 

c(l) = c( 0 ) + i(l) O yg(l) = i(l) o yg(l) 
c( 2 ) = c(l) + i( 2 ) oyg( 2 ) 


z{t) = ^\{k) oyg{k) = ^diag{yg(k))\(k) (79) 


a(t) = o{t) o (1 — h{c{t))) o (1 + h{c{t))) 
biW = yg{t) o i{t) o (1 - i(t)) (86) 

substituting above equation in ( |7Q| ) we will have: 

= diag{Sll^^{t)).- 


ai?(Q,r>) _ 3 dcQ{t) 


rec3 


aw,. 


dWrecS 


VD y ''-gw. 


(87) 


rec3 


k=l 


k=l 


where 

cree3 / 

dcQjt) 

dWrecS 

bi,Q(/) = yg,Q(i) o iqit) o (1 - iQ(/)) 


= (1 - ^(CqW)) O (1 + h{CQ{t))) o OQ{t) o Vq(/) 
= diag{fQ{t)).^^^2_Il. + bi,Q(i).yQ(i - 1)^ 

( 88 ) 
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In equation ( [ST] ), Sy'^^{t) and are the same as 

l | 8 ^ with D subscript. Therefore, update equations for 
Wrec 3 are ( |87l >, ( | 88 ] ) for Q and D and ( 6 ). 

With a similar procedure for W 3 we will have the 
following: 


dR{Q,D) 

dWs 


= diagiSlfm 


dcQjt) 

dWs 


+ dia9i5;<^f{t)). 


dcoit) 

5 W 3 


(89) 


where 


(90) 

Therefore, update equations for W 3 are ( [ 8 ^ , ( |^ for Q 
and D and ( 6 ). 

For peephole connections we will have: 


dR{Q,D) 

dWpS 


diag{5l^^{t)) 


dcQjt) 

'dWp3 


+ diag{5l^^^{t)) 


dcoit) 

'dWp3 


(91) 


where 


(92) 

Hence, update equations for Wp 3 are @ for Q 
and D and ( 6 ). 

Following similar derivation for bias values bs we will 
have: 


dR{Q,D) 

dhs 


= diagiSlfm 


dcQjt) 

dhs 


+ diagi5;i^^{t)). 


dCD{t) 

dhz 


(93) 


where 


= diag{tQ{t)). ^^^^^^ + bi,Q(f) (94) 

Update equations for bs are for Q and D and 

( 6 ). 

3) Forget Gate: For forget gate, with a similar deriva¬ 
tion to input gate we will have 


dyjt) 

dWrec2 


[diag{a{t))][diag{f{t)) 


dc{t — 1 ) 
■ d'Wrec2 




(95) 


where 


substituting above equation in (|70|i we will have: 


dR{Q,D) 

dWrec2 


diag{5l^f{t)) 


dcQjt) 

d'Wrec2 


+ diagi5Z^\t)). 


dcoit) 

dWrec2 


(97) 


where 

= (1 - (l(CQ(f))) o (1 + h{CQ{t))) o OQ{t) o VQ(f) 
= *„s(fo(t)).?^^ +b,,o(t).yo(t - !)’■ 
W = CQ(f - 1) o fQ{t) O (1 - fQ{t)) (98) 


Therefore, update equations for Wrec 2 are ( |^ , ( |^ for 
Q and D and ( 6 ). 

For input weights to forget gate, W 2 , we have 


dR{Q,D) 

dW2 


diag{5ll^^{t)) 


dcQjt) 

■ dW2 


+ diagi5Z^\t)). 


dcoit) 

dW2 


(99) 


where 


= diag{fQ{t)).^^^^ +hf,QitUQitf 

( 100 ) 

Therefore, update equations for W 2 are ( |^ , ( |100| ) for 
Q and D and ( 6 ). 

For peephole connections, Wp 2 , we have 


dR{Q,D) 

dWp2 


diag{5ll^^{t)) 


dcQjt) 

'dWp2 


+ diag{5l^^^{t)) 


dCD{t) 


( 101 ) 


where 


( 102 ) 

Therefore, update equations for Wp 2 are ( |101| ), ( |1Q2| ) 
for Q and D and ( 6 ). 

Update equations for forget gate bias values, b 2 , will 
be following equations and ( 6 ): 


dR{Q,D) 

dh2 


diag{5ll^^{t)) 


dcQjt) 

dh2 


+ diag{5l-f{t)). 


dCD{t) 

dh2 


(103) 


where 


a(f) = o{t) o (1 - h{c{t))) o (1 + h{c{t))) 
hf(t) = c(t — 1 ) o f(t) o (1 — f(f)) (96) 


dcQ{t) 

dh2 




( 104 ) 
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4) Input without Gating (yg{t)): Gradients for y^(t) A. LSTM-RNN Semantic Vectors: Another Example 
parameters are as follows: Consider the following example from test dataset: 


dyjt) 

OW rec4 


[diag{3.{t))][diag{f{t)). 


dc{t — 1 ) 
OW rec 4 




(105) 


where 


a.{t) = o{t) o (1 - o (1 + h{c{t))) 

^g{t) = Kt) ° (1 - Ygit)) ° (1 + yg{t)) ( 106 ) 

substituting above equation in ( |70l l we will have: 


• Query: ''how to fix bath tub wont turn off 

• Document: "how do you paint a bathtub and what 
paint should you use yahoo answers'' 

-V-" 

treated as one word 

Activations of input gate, output gate, cell state and 
cell output for each cell for query and document are 
presented in Fig|^ and Fig[T^ respectively based on a 
trained LSTM-RNN model. 

Three interesting observations from Fig|^ and Fig[T 0 | 


dR{Q,D) 
OW rec4 




dcQjt) 

dW rec4 


+ dia9i5Z^\t)). 


dCD{t) 

7'ec4 


(107) 


where 


= (1 - KcQ(t))) O (1 + h{CQ{t))) o OQ{t) o VQ{t) 

+ b,,e(.).y«{( -1)’- 


bff.Q(0 = 1^(0 ° (1 - yg,QX)) ° (1 + ys,Q(i)) (108) 


Therefore, update equations for WrecA are ( fToTl i, ( fTosj i 
for Q and D and ( 6 ). 

For input weight parameters, W 4 , we have 


dR{Q,D) 

dW4 




dcQjt) 

■ aW4 


Semantic representation y{t) and cell states c{t) are 
evolving over time. 

In part (a) of Fig[^ we observe that input gate 
values for most of the cells corresponding to word 
3, word 4, word 7 and word 9 in document side 
of LSTM-RNN have very small values (light blue 
color). These are corresponding to words "you", 
"paint", "and" and "paint" respectively in the 
document title. Interestingly, input gates are trying 
to reduce effect of these words in the final repre¬ 
sentation (y(t)) because the LSTM-RNN model is 
trained to maximize the similarity between query 
and document if they are a good match. 
y(t) is used as semantic representation after apply¬ 
ing output gate on cell states. Note that valuable 
context information is stored in cell states c(t). 


+ diag{5ll^\t)). 


dcnit) 

5 W 4 


(109) 


B. Key Word Extraction: Another Example 


where 

= diag{fQ{t)).^^^^f- + bg,Q(f).XQ(f)^ 

^ " ( 110 ) 
Therefore, update equations for W 4 are ( |1Q9| ), ( |11Q| ) for 
Q and D and ( 6 ). 

Gradients with respect to bias values, b 4 , are 


ORjQ^D) 

Oh^ 


= diag{5ll^\t)). 


dcQjt) 

Oh^ 




dCD{t) 

Oh/^ 


( 111 ) 


where 

= diag{iQ{t)). ^f^~ + hg,Q{t) ( 112 ) 

Therefore, update equations for b 4 are ( |111| ), ( |112| ) for 
Q and D and ( 6 ). There is no peephole connections for 
yg{t)- 


Appendix E 

LSTM-RNN Visualization 

In this appendix we present more examples of LSTM- 
RNN visualization. 


Evolution of 10 most active cells over time for the 
second example are presented in Fig. [TT] for query and 
Fig. 12 for document. Number of assigned cells out of 
10 most active cells to each word are presented in Table 
llXland Tables 


Appendix F 

Doc2Vec Similarity Test 

To make sure that a meaningful model is trained, 
we used the trained doc 2 vec model to find the most 
similar words to two sample words in our dataset, the 
words “pizza” and “infection”. The resulting words and 
corresponding scores are as follows: 


print(model.most-similar(’pizza’)) : 

[(u’recipes’, 0.9316294193267822), 
(u’recipe’, 0.9295548796653748), 
(u’food’, 0.9250608682632446), 
(u’restaurants’, 0.9223555326461792), 
(u’bar’, 0.9191627502441406), 
(u’sabayon’, 0.916868269443512), 
(u’page’, 0.9160783290863037), 
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(c) o{t) 

Fig. 9. Query: ''how to fix bath tub wont turn ojfi' 


(d) y{i) 


TABLE IX 

Keyword extraction eor Query: "how to fix bath tub wont turn off’" 



how 

to 

fix 

bath 

tub 

wont 

turn 

off 

Number of assigned 
cells out of 10 

Left to Right 


0 

4 

1 

6 

3 

5 

0 

Number of assigned 
cells out of 10 
Right to Left 

4 

1 

6 

1 

6 

7 

7 



(u’restauranf, 0.9112323522567749), (u’medlineplus’, 0.9032401442527771), 

(u’house’, 0.9104640483856201), (u’gouf, 0.9027985334396362)] 

(u’the’, 0.9103578925132751)] _ 

_ As it is observed from the resulting words, the trained 

print(model.most-similar(’infection’)): ^ meaningful model and can recognise semantic 

similarity. 


[(u’infections’, 0.9698576927185059), 
(u’treatmenf, 0.9143450856208801), 
(u’symptoms’, 0.9138627052307129), 
(u’disease’, 0.9100595712661743), 

(u’palpitations’, 0.9083651304244995), 
(u’pneumonia’, 0.9073051810264587), 
(u’medical’, 0.9043352603912354), 
(u’abdomen’, 0.9034136533737183), 


Appendix G 

Diagram of the proposed model 

To clarify the difference between the proposed method 
and the general sentence embedding methods, in this 
section we present a diagram illustrating the training 
procedure of the proposed model. It is presented in 
Fig. In this figure n is the number of negative 
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(c) o{t) 


(d) y{t) 


Fig. 10. Document: ''how do you paint a bathtub and what paint should ... 


TABLE X 

Keyword extraction eor Document: "how do you paint a bathtub and what paint should ... 



how 

do 

you 

paint 

a 

bathtub 

and 

what 

paint 

should you ... 

Number of assigned 
cells out of 10 

Left to Right 


1 


7 

0 

9 

2 

3 

8 

4 

Number of assigned 
cells out of 10 
Right to Left 

5 

9 

5 

4 

8 

4 

5 

5 

9 



(unclicked) documents. The other parameters in this 
figure are similar to those used in Fig. 2 and Fig. 3 
of the paper. 
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Fig. 12. Document: ''how do you paint a bathtub and what paint should ... 
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P(D+|(2) 


p(Dri(3) 


P(D^\Q} 


P(Dn\Q} 



LSTM cells and gates 


/K 



Semantic 

representation of the 
document 


Fig. 13. Architecture of the proposed method. 













































